Created by: Felipe Rodríguez.

Purpose: run Linear and Machine Learning models on the predictors of migration.

Data

We use different datasets from the Worl Bank and the Migration Policy Institute. We use variables that measure economic performance, inequality, life expectancy, popultion, remittances and aid. We analyze only the three countries of the northern triangle; El Salvador, Guatemala and Honduras. These countries have the higher levels of emigration to the U.S. after Mexico in Latin America.

World bank data: governance, trade, labor conditions, economic performance, education and life expectancy

Migration Policy Institute: Migration the countries of the northern triangle of Central America

Time Series from 1980-2015

Models and robustness checks

OLS, Lasso, Elastic Net, Regression Tree, Random Forrest, PCA and KNN

Findings

Our dependent variable has missing values, this may affect our results in the different estimations we are running. However, migration data from these countries is scarce and we are interested to analyze this topic even with the issues described. The results are enlightening, anyway.

It is fair to say, this is merely an exploratory attempt to compare Linear and Machine Learning models on the predictors of migration in the Northern Triangle of Central America. Further analysis should make efforts to use better secondary data and fix the collinearity issues described below.

The best predictors according to the different models conducted are related to indicators of economic performance, population growth, remittances, trade, death rates and inequality.

1. Data Cleaning

Here we read, revise and clean each of the datasets separately and then we merge them to choose our main variables. We also drop variables with too many missing observations. For the rest of the variables we replace the missing values with the mean. Finally, we convert the variables to more appropriate formats.

Replacing missing values with the mean of each decade

We have finally merged and cleaned our datasets. Here we can start with the analysis.

2. Checking for correlations accross our variables

Model

We drop variables with high correlation

We define a new variable for time variation

We have reduced the correlation in our dataset, now we want to see which are our best predictors for migration. We revise our variables one last time.

3. Predictions

4. Linear and Maching Learning Models

4.1 OLS

4.2 Regularization

4.2.1 Ridge

4.2.2 Lasso

4.2.3 Elastic Net

Best Predictors

We find similar results with lasso

We find similar results with elastic net

4.3 Regression Tree

With regression tree, the findings suggest that remmitances and billateral aid are also relevant predictors.

Our R2 is too high we may still have some isssues with correlation among our predictors

4.4 Random Forrest

Remittances might have some colinearity with the levels of migration and that might be overestimating our R2. Economic performance, population, death rates and inequality are important according to our results.

Our findings with random forrest seems to match our expectations about the best predictors from an intuitively point of view. Even though our R2 is still too high, it suggests that remittances might be causing collinearity in our model, based on its high relevance as a predictor.

5. PCA

Now we are going to use PCA and KNN in our estimations.

Subset of the data without the threshold

Standardize the variables

Fit a PCA on the standardized data

Plot the variance explained by the ratio of the components

Plot a seaborn pairplot of PC1, PC2, and PC3 with hue='threshold'

Horn's parallel analysis

Run parallel analysis for the migration data

Plot the wine eigenvalues (.variance_explained_) against the parallel analysis random eigenvalue cutoffs

Predict "threshold" from original data and from PCA

We found very similar results with the KNN and logistic estimation

Perform stratified cross-validation on a KNN classifier and logisitic regression.

We found more accurate results using a stratified cross validation in both KNN and the logistic estimation

We have a significant increase in accuracy doing the parallel estimation of KNN and PC

Confusion Matrix for each of your classification methods.

Our results from our confusion matrices suggest that it is better to use the parallel estimation of KNN and PC in order to have more accurate estimations.